9. Clustering¶
The starter code can be found in k_means/k_means_cluster.py, which reads in the email + financial (E+F) dataset and gets us ready for clustering. You'll start with performing k-means based on just two financial features--take a look at the code, and determine which features the code uses for clustering.
Run the code, which will create a scatterplot of the data. Think a little bit about what clusters you would expect to arise if 2 clusters are created.
#k_means_cluster.py
%matplotlib inline
#!/usr/bin/python
"""
Skeleton code for k-means clustering mini-project.
"""
import pickle
import numpy
import matplotlib.pyplot as plt
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
def Draw(pred, features, poi, mark_poi=False, name="image.png", f1_name="feature 1", f2_name="feature 2"):
""" some plotting code designed to help you visualize your clusters """
### plot each cluster with a different color--add more colors for
### drawing more than five clusters
colors = ["b", "c", "k", "m", "g"]
for ii, pp in enumerate(pred):
plt.scatter(features[ii][0], features[ii][1], color = colors[pred[ii]])
### if you like, place red stars over points that are POIs (just for funsies)
if mark_poi:
for ii, pp in enumerate(pred):
if poi[ii]:
plt.scatter(features[ii][0], features[ii][1], color="r", marker="*")
plt.xlabel(f1_name)
plt.ylabel(f2_name)
plt.savefig(name)
plt.show()
### load in the dict of dicts containing all the data on each person in the dataset
data_dict = pickle.load( open("../17. Final Project/final_project_dataset.pkl", "r") )
### there's an outlier--remove it!
data_dict.pop("TOTAL", 0)
### the input features we want to use
### can be any key in the person-level dictionary (salary, director_fees, etc.)
feature_1 = "salary"
feature_2 = "exercised_stock_options"
poi = "poi"
features_list = [poi, feature_1, feature_2]
data = featureFormat(data_dict, features_list )
poi, finance_features = targetFeatureSplit( data )
### in the "clustering with 3 features" part of the mini-project,
### you'll want to change this line to
### for f1, f2, _ in finance_features:
### (as it's currently written, the line below assumes 2 features)
for f1, f2 in finance_features:
plt.scatter( f1, f2 )
plt.show()
### cluster here; create predictions of the cluster labels
### for the data and store them to a list called pred
### rename the "name" parameter when you change the number of features
### so that the figure gets saved to a different file
try:
Draw(pred, finance_features, poi, mark_poi=False, name="clusters.pdf", f1_name=feature_1, f2_name=feature_2)
except NameError:
print "no predictions object named pred found, no clusters to plot"
What features will your clustering algorithms use?
Above code uses salary
and exercised_stock_options
Deploy clustering
Scikit: here
#k_means_cluster.py
%matplotlib inline
#!/usr/bin/python
"""
Skeleton code for k-means clustering mini-project.
"""
import pickle
import numpy
import matplotlib.pyplot as plt
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
def Draw(pred, features, poi, mark_poi=False, name="image.png", f1_name="feature 1", f2_name="feature 2"):
""" some plotting code designed to help you visualize your clusters """
### plot each cluster with a different color--add more colors for
### drawing more than five clusters
colors = ["b", "c", "k", "m", "g"]
for ii, pp in enumerate(pred):
plt.scatter(features[ii][0], features[ii][1], color = colors[pred[ii]])
### if you like, place red stars over points that are POIs (just for funsies)
if mark_poi:
for ii, pp in enumerate(pred):
if poi[ii]:
plt.scatter(features[ii][0], features[ii][1], color="r", marker="*")
plt.xlabel(f1_name)
plt.ylabel(f2_name)
plt.savefig(name)
plt.show()
### load in the dict of dicts containing all the data on each person in the dataset
data_dict = pickle.load( open("../17. Final Project/final_project_dataset.pkl", "r") )
### there's an outlier--remove it!
data_dict.pop("TOTAL", 0)
### the input features we want to use
### can be any key in the person-level dictionary (salary, director_fees, etc.)
feature_1 = "salary"
feature_2 = "exercised_stock_options"
poi = "poi"
features_list = [poi, feature_1, feature_2]
data = featureFormat(data_dict, features_list )
poi, finance_features = targetFeatureSplit( data )
### in the "clustering with 3 features" part of the mini-project,
### you'll want to change this line to
### for f1, f2, _ in finance_features:
### (as it's currently written, the line below assumes 2 features)
for f1, f2 in finance_features:
plt.scatter( f1, f2 )
plt.show()
### cluster here; create predictions of the cluster labels
### for the data and store them to a list called pred
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(finance_features)
pred = kmeans.predict(finance_features)
### rename the "name" parameter when you change the number of features
### so that the figure gets saved to a different file
try:
Draw(pred, finance_features, poi, mark_poi=False, name="clusters.pdf", f1_name=feature_1, f2_name=feature_2)
except NameError:
print "no predictions object named pred found, no clusters to plot"
After I see the clustering, it makes some sense. It kinda shows certain outliers ;)
Clustering with 3 Features
Add a third feature to features_list, 'total_payments". Now rerun clustering, using 3 input features instead of 2 (obviously we can still only visualize the original 2 dimensions). Compare the plot with the clusterings to the one you obtained with 2 input features. Do any points switch clusters? How many? This new clustering, using 3 features, couldn't have been guessed by eye--it was the k-means algorithm that identified it.
(You'll need to change the code that makes the scatterplot to accommodate 3 features instead of 2, see the comments in the starter code for instructions on how to do this.)
#k_means_cluster.py
%matplotlib inline
#!/usr/bin/python
"""
Skeleton code for k-means clustering mini-project.
"""
import pickle
import numpy
import matplotlib.pyplot as plt
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
def Draw(pred, features, poi, mark_poi=False, name="image.png", f1_name="feature 1", f2_name="feature 2"):
""" some plotting code designed to help you visualize your clusters """
### plot each cluster with a different color--add more colors for
### drawing more than five clusters
colors = ["b", "c", "k", "m", "g"]
for ii, pp in enumerate(pred):
plt.scatter(features[ii][0], features[ii][1], color = colors[pred[ii]])
### if you like, place red stars over points that are POIs (just for funsies)
if mark_poi:
for ii, pp in enumerate(pred):
if poi[ii]:
plt.scatter(features[ii][0], features[ii][1], color="r", marker="*")
plt.xlabel(f1_name)
plt.ylabel(f2_name)
plt.savefig(name)
plt.show()
### load in the dict of dicts containing all the data on each person in the dataset
data_dict = pickle.load( open("../17. Final Project/final_project_dataset.pkl", "r") )
### there's an outlier--remove it!
data_dict.pop("TOTAL", 0)
### the input features we want to use
### can be any key in the person-level dictionary (salary, director_fees, etc.)
feature_1 = "salary"
feature_2 = "exercised_stock_options"
feature_3 = "total_payments"
poi = "poi"
features_list = [poi, feature_1, feature_2, feature_3]
data = featureFormat(data_dict, features_list )
poi, finance_features = targetFeatureSplit( data )
### in the "clustering with 3 features" part of the mini-project,
### you'll want to change this line to
### for f1, f2, _ in finance_features:
### (as it's currently written, the line below assumes 2 features)
for f1, f2, _ in finance_features:
plt.scatter( f1, f2 )
plt.show()
### cluster here; create predictions of the cluster labels
### for the data and store them to a list called pred
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(finance_features)
pred = kmeans.predict(finance_features)
### rename the "name" parameter when you change the number of features
### so that the figure gets saved to a different file
try:
Draw(pred, finance_features, poi, mark_poi=False, name="clusters.pdf", f1_name=feature_1, f2_name=feature_2)
except NameError:
print "no predictions object named pred found, no clusters to plot"
Now 4 points reclustered. Only one outlier in 2nd cluster..
Visualizing in 3D could give us a hint, why that outlier was outlier.
from mpl_toolkits.mplot3d import Axes3D
fig,ax = plt.subplots()
ax = Axes3D(fig, rect=[0, 0, 1, 1], elev=30, azim=120)
ax.set_xlabel('salary')
ax.set_ylabel('exercised_stock_options')
ax.set_zlabel('total_payments')
for f1, f2, f3 in finance_features:
ax.scatter( f1, f2, f3)
#fig.show()
What are the maximum and minimum values taken by the 'exercised_stock_options' feature used in this example?
(NB: if you look at finance_features, there are some "NaN" values that have been cleaned away and replaced with zeroes--so while those might look like the minima, it's a bit deceptive because they're more like points for which we don't have information, and just have to put in a number. So for this question, go back to data_dict and look for the maximum and minimum numbers that show up there, ignoring all the "NaN" entries.)
eso_list = []
for k,v in data_dict.iteritems():
eso = v['exercised_stock_options']
if eso != 'NaN':
eso_list.append(eso)
print max(eso_list)
print min(eso_list)
What are the maximum and minimum values taken by 'salary'?
sal_list = []
for k,v in data_dict.iteritems():
sal = v['salary']
if sal != 'NaN':
sal_list.append(sal)
print max(sal_list)
print min(sal_list)